What follows is an analysis of applicant data for Donors Choose, an educational granting organization located in the United States. Donors Choose provides educational funding for schools and teachers alike. Teachers are first required to submit an application, and applications are either approved or denied.
For this project, we begin with various data manipulations and feature engineering. Data manipulations include: target encoding, natural language processing, and other, more rudimentary, straightforward manipulations, like recoding variables and numericizing variables. Second, we create various data visualizations to facilitate the development of a thorough understanding of the data at hand. These visualizations are, in large part, descriptive, and, as such, deal with descriptive statistics. Third, we conduct various analyses of the data set (predictive modeling) in order to gain greater insight into what accounts for approval of a given application. (Our target variable is approved or not approved, 1 or 0).
Note on the target variable: one of the main challenges in dealing with the following data set, with respect to predictive modeling, is that the target variable is highly imbalanced. In other words, most applications are approved. Now, because the pool of approved applicants far outnumbers the pool of denied applicants, it is difficult to find meaningful distinctions between the two groups.
How are we to create an accurate, meaningful, predictve model when, +80% accuracy is achieved by merely predicting ‘approved’ for all new applications? Further, and more importantly, because the bulk of the data set deals with approved applications, we’re working with very limited information with regard to denied applications (this might seem a trivial observation, but it is quite important, as trends apparent in the data set are, for the most part, merely trends in approved applications, and this can lead to, well, misleading results.) How are we to distinguish between two objects when one objects is, in relative terms, largely unknown? Of course, this conundrum is characteristic to classification problems, and logic more generally.
Anyhow, this kernel was largely inspired by Bukun, and even more so by Heads or Tails. Indeed, this kernel might best be considered the ‘Variation on a Theme by Heads or Tails’ – more or less.
Finally, as a measure of accuracy, AUC is employed.
To begin, we load libraries pertinent to data manipulation and data visualization in R. Tidyverse contains a host of a tools useful for feature engineering and data cleansing. The car package has two uses: the first is for recoding inputs, and the second is for statistical analyses, as demonstrated below. Lubridate is useful for manipulating dates.
Having loaded preliminary libraries, we read in the first of our data sets. (Note that we will begin with two data sets for our initial analyses.)
Using ‘glimpse’ from the tidyverse, we can get a better feel for our data. (We begin with the data set entitled, ‘train’ and deal with ‘resources’ further below.) More specifically, we discover the data set we’re dealing with is as interesting as it is varied: we have numerical inputs, no factors, and quite a few character vectors. Some of these variables need recoding. For example, teacher_prefix is listed as character vector when in fact it should be a factor. Again, we will deal with this below.
## Observations: 182,080
## Variables: 16
## $ id <chr> "p036502", "p039565…
## $ teacher_id <chr> "484aaf11257089a66c…
## $ teacher_prefix <chr> "Ms.", "Mrs.", "Ms.…
## $ school_state <chr> "NV", "GA", "UT", "…
## $ project_submitted_datetime <dttm> 2016-11-18 14:45:5…
## $ project_grade_category <chr> "Grades PreK-2", "G…
## $ project_subject_categories <chr> "Literacy & Languag…
## $ project_subject_subcategories <chr> "Literacy", "Perfor…
## $ project_title <chr> "Super Sight Word C…
## $ project_essay_1 <chr> "Most of my kinderg…
## $ project_essay_2 <chr> "I currently have a…
## $ project_essay_3 <chr> NA, NA, NA, NA, NA,…
## $ project_essay_4 <chr> NA, NA, NA, NA, NA,…
## $ project_resource_summary <chr> "My students need 6…
## $ teacher_number_of_previously_posted_projects <dbl> 26, 1, 5, 16, 42, 0…
## $ project_is_approved <dbl> 1, 0, 1, 0, 1, 1, 1…
Most of the code here is relatively straightforward and again, constitutes variation on an existing theme. Of interest: prefixes like ‘Mr.’, ‘Mrs.’, and ‘Ms.’ were recoded as ‘Male’ and ‘Female’ (‘M’ and ‘F’). Said recoding decreases cardinality. Of course, this is certainly not a problem in the case of teacher_prefix (as opposed to school_state); nonetheless, said encoding allows us to deal with this feature from a simpler, and therefore more meaningful, standpoint.
train <- train %>%
mutate(teacher_id = as.factor(teacher_id),
teacher_prefix = as.factor(teacher_prefix),
school_state = as.factor(school_state),
project_submitted_datetime = ymd_hms(project_submitted_datetime),
day = ymd(str_sub(as.character(project_submitted_datetime),1,10)),
project_grade_category = as.factor(project_grade_category),
project_grade_category = fct_relevel(project_grade_category, "Grades PreK-2", after = 0),
project_subject_categories = as.factor(project_subject_categories),
project_subject_subcategories = as.factor(project_subject_subcategories),
project_is_approved = as.logical(project_is_approved),
sex = recode(teacher_prefix,
"'Mr.'='M'; 'Mrs.'='F'; 'Ms.'='F'; 'Teacher'='F'; 'Dr.'='F'"),
previous_projects=as.numeric(teacher_number_of_previously_posted_projects))We continue with our initial data manipulations. Of interest: we engineer two new features, ‘Experience’, and ‘special_needs_and_interdisciplinary’. ‘Experience’ pertains to whether or not teachers have previously submitted proposals for funding. As seen in the chunk below, ‘Novice’ is a 1st time applicant, ‘Beginner’ is a 2nd time applicant, and ‘Experienced’ is an applicant who has applied for funding more than once. These thresholds were created as a means to the end of pooling applicants based on previous applications. In brief, it was thought that Experienced applicants – that is, applicants who have applied more than once in the past – would have a higher likelihood of approval than those applicants who had never previously applied.
The feature ‘special_needs_and_interdisciplinary’ gets at the following: some of the applications were interdisciplinary in nature (for example, some applications requested funding for projects that hybridized science and mathematics). Other projects were uni-disciplinary in nature; and finally, a porportion of the projects were related to special needs instruction. In the ‘Significance Testing’ section below, it is seen that these groups are in fact significant predictors of approval, and therefore, it would seem reasonable that they are included in our model.
train <- train %>%
mutate(
experience = recode(previous_projects,
"c(0)='Novice';
c(1)='Beginner';
else='Experienced'"),
specialneeds = str_detect(project_subject_categories,"Special Needs"),
specialneeds = as.numeric(specialneeds),
interdisciplinary = str_detect(project_subject_categories, ", "),
interdisciplinary = as.numeric(interdisciplinary),
special_needs_and_inter = specialneeds + interdisciplinary,
special_needs_and_interdisciplinary =
recode(special_needs_and_inter,
"c(0)='Uni-Disciplinary';
c(1)='Interdisciplinary Studies';
else='Special Needs'"),
specialneeds=NULL,
interdisciplinary=NULL,
special_needs_and_inter=NULL)Below are yet more data manipulations. These are relatively straightforward. In brief, the date in the data frame is of the following format: for example, December 1st 2017 reads as 12-01-2017. From this, we extract the day of the week the application was submitted, along with the month of the year, to see if weekday or month had any effect on the likelihood of approval. Prelminary significance testing suggests that these features are well worth including, as they are statistcially significant when controlling for other features.
train$weekday = weekdays(train$project_submitted_datetime)
train$monthname =
month(train$project_submitted_datetime, label = FALSE, abbr = FALSE)
train$weekday = as.factor(train$weekday)
train$monthname = as.factor(train$monthname)
train$year = year(train$project_submitted_datetime)
train$year = as.factor(train$year)Our first visualization shows the frequency of submission as it relates to time of year. Notice that there are spikes in the rate of submission near the end of the summer (or the beginning of the school year), and around Spring Break. This makes intuitive sense: teachers have more free time during these periods; further, they are likely preparing for upcoming instructional modules.
p1 <- train %>%
count(day) %>%
ggplot(aes(day, n)) +
geom_line(col = "darkblue") +
labs(x = "Date", y = "Submitted Proposals",
title = "Submission Date and Count",
subtitle = "Irrespective of Acceptance")
p1Here, we subset our data into two sections: (1) applications approved and (2) applications denied. We then plot the disparate data sets to get a better feel for the rate of submission as it relates to approval and denial.
train$project_is_approved=as.numeric(train$project_is_approved)
applications_approved = filter(train, project_is_approved > 0)
applications_denied = filter(train, project_is_approved < 1) Our first plot deals with approved applications, exclusively. Note that spikes in rate of submission for approved projects maps neatly onto the overall rate of submission, as seen above, prior to our partitioning. This is undoubtedly due to the imbalanced nature of the data set: in other words, the large majority of applications were approved; therefore, approved applications drive most trends in the data. This is important to note, as it is a difficulty that must be dealt with at some point. The ‘To Do’ of all ‘To Dos’.
p2 <- applications_approved %>%
count(day) %>%
ggplot(aes(day, n)) +
geom_line(col = "darkred", alpha = 0.8) +
labs(x = "Date", y = "Accepted Proposals",
title = "Submission Date and Count",
subtitle = "Accepted")
p2Below we plot denied applications in the foreground in blue, and accepted applications in the background in red. Note that spikes in frequency of submission are much less pronounced for denied applications.
p3 <- applications_denied %>%
count(day) %>%
ggplot(aes(day, n)) +
geom_line(col = "darkblue", alpha = 0.8) +
labs(x = "Date", y = "Denied Proposals",
title = "Submission Date and Count",
subtitle = "Denied")
z =
applications_approved %>%
count(day)
p4 = p3 +
geom_line(data = z, aes(z$day, z$n),col = "darkred", alpha = 0.4) +
labs(x = "Date", y = "Proposals: Denied and Accepted",
title = "Submission Date and Count",
subtitle = "Denied (Blue) Over Accepted (Red)")
p4A hidden code chunk exists immediately above – one used for the visualizations below; that is, below you will find a matrixed representation of the visualizations presented above. Though redundant, I thought to include it for purposes purely aesthetic. The code chunk can be found in the R CookBook (the multiplot function), or in the code section of this kernel (see below).
We continue with multi-colored visualizations, first focusing on submissions per state. Note that the rate of submission per California and New York far exceeds the rate of submission per other states. This is interesting, particularly with regard to the rate of submission for California, and it is worth continued investigation.
state_and_previous_projects =
train%>%
select(previous_projects, school_state)
p5 = ggplot(data=state_and_previous_projects,
aes(school_state, previous_projects, fill = school_state)) +
geom_col() +
theme(legend.position = "none", axis.text.x = element_text(angle=60, hjust=1, vjust=0.9)) +
labs(x = "State", y = "Frequency of Submissions",
title = "Frequency of Submissions by State")
p5Below: a visualization of the frequency of submission as it pertains to the engineered feature, ‘sex’. Interestingly, the rate of submission for females far exceeds the rate of submission for males, and the feature is statistically significant in terms of likelihood of approval (please refer to the section titled, ‘Significance Testing’ below for greater detail.) Similarly, state of submission is a significant predictor of approval.
p6 = ggplot(data = train, aes(x = sex)) +
geom_bar(fill = "darkblue") +
theme(legend.position = "none", axis.text.x = element_text(angle=60, hjust=1, vjust=0.9)) +
labs (y = "Count", x = "Sex",
title = "Count of Submissions by Female and Male",
subtitle= "Archaic Categorization")
p6Below is yet another visualization, but this visualization deals with the engineered feature ‘Experience’ as it relates to frequency of submission. We see a disporportionate amount of applicants have applied for funding from Donors Choose previously.
p7 = ggplot(data = train, aes(x = experience)) +
geom_bar(fill = "darkred") +
theme(legend.position = "none", axis.text.x = element_text(angle=60, hjust=1, vjust=0.9)) +
labs (y = "Count", x = "Experience",
title = "Count of Submissions by Experience",
subtitle= "A Feature, Engineered")
p7Below are continued data manipulations. Various features are removed due to redundancy. For example, the information contained in teacher_prefix (i.e., Mr., Ms., etc.) is contained in the engineered feature, ‘sex’. Accordingly, we drop the ‘teacher_prefix’ feature.
train <- train %>%
mutate(teacher_prefix = NULL,
teacher_number_of_previously_posted_projects=NULL, #-- as previous projects
project_submitted_datetime=NULL, #-- not necessary now that we have our date
X=NULL, #-- not informative
teacher_id=NULL,
experience = as.factor(experience),
special_needs_and_interdisciplinary=as.factor(special_needs_and_interdisciplinary),
project_is_approved=as.logical(project_is_approved))Below is yet another feature, engineered. The logic undergirding the engineering is as follows: many of the ‘project_subject_categories’ are two-tiered; for example, a teacher might submit an application for ‘Math and Science: Applied Learning’. Said two-tiered structuring results in high cardinality and redundancy. The feature engineering below attempts to remove the redundancy, reduce cardinality, while retaining pertinent information. Signficance testing below suggests that indeed these features may potentially improve predictive power.
train <- train %>%
mutate(subject_numbered = as.numeric(project_subject_categories),
subject_broad = recode(subject_numbered,
"c(1, 2, 3, 4, 5, 6, 7)='Applied Learning';
c(8, 9, 10, 11, 12, 13)='Health Sports';
c(14, 15, 16, 17, 18, 19)='History Civics';
c(20, 21, 22, 23, 24, 25)='Literacy and Language';
c(26, 27, 28, 29, 30, 31, 32)='Math and Science';
c(33, 34)='Music and Art';
c(35, 36)='Special Needs';
else='Warmth Care Hunger'"),
subject_broad=as.factor(subject_broad))
train <- train %>%
mutate(subject_numbered = NULL)Below is a technicolor visualization of the feature engineered above.
subject_broad_and_previous_projects =
train%>%
select(subject_broad, previous_projects)
p8 = ggplot(data = subject_broad_and_previous_projects) +
geom_col(aes(x = subject_broad, y= previous_projects, fill = subject_broad)) +
theme(legend.position = "none", axis.text.x = element_text(angle=60, hjust=1, vjust=0.9)) +
labs (y = "Count", x = "Broad Subject",
title = "Count of Submissions by Broad Subject",
subtitle= "A Feature, Engineered")
p8For predictive modeling, significance testing of the sort found directly below will likely seem trivial, superfluous, even silly. To my mind, significance testing can be used to assess whether or not the features engineered above, on a very basic level, have the potential to improve model performance. In short, a significant test statistic (p < 0.05) suggests the following: we are on the right track, and we would do well in including the feature, as it will perhaps improve model performance. Of course, this is no guarentee: increasing features often runs amok of significance tests, and, as can be seen below, the models are simple in the extreme. (They are used chiefly to get a fingertip feel for feature importance, as it were.)
Below are a few tests assessing the significance of the features engineered above. Note: I have refrained from outputting coefficients and exponeniated coefficients aside from the first test. They are not germaine to the sort of predictive modeling that is the aim of this project. Nonetheless, as seen immediately below, due to their simplicity, these models are extremely interpretable, and for this reason, have strengths other models do not.
fit_broad_subject_category =
glm(project_is_approved ~ subject_broad,
data = train,
family = binomial)
Anova(fit_broad_subject_category, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## subject_broad 476 7 67.991 < 2.2e-16 ***
## Residuals 182080 182072
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## glm(formula = project_is_approved ~ subject_broad, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0175 0.5292 0.5430 0.5883 0.6264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.535030 0.020209 75.957 < 2e-16
## subject_broadHealth Sports 0.173649 0.028485 6.096 1.09e-09
## subject_broadHistory Civics 0.133168 0.041201 3.232 0.00123
## subject_broadLiteracy and Language 0.360016 0.024759 14.541 < 2e-16
## subject_broadMath and Science 0.304636 0.025290 12.046 < 2e-16
## subject_broadMusic and Art -0.006093 0.025128 -0.242 0.80841
## subject_broadSpecial Needs 0.035890 0.065900 0.545 0.58601
## subject_broadWarmth Care Hunger 0.131344 0.025928 5.066 4.07e-07
##
## (Intercept) ***
## subject_broadHealth Sports ***
## subject_broadHistory Civics **
## subject_broadLiteracy and Language ***
## subject_broadMath and Science ***
## subject_broadMusic and Art
## subject_broadSpecial Needs
## subject_broadWarmth Care Hunger ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 155390 on 182079 degrees of freedom
## Residual deviance: 154914 on 182072 degrees of freedom
## AIC: 154930
##
## Number of Fisher Scoring iterations: 4
## (Intercept) subject_broadHealth Sports
## 4.6414651 1.1896384
## subject_broadHistory Civics subject_broadLiteracy and Language
## 1.1424417 1.4333522
## subject_broadMath and Science subject_broadMusic and Art
## 1.3561310 0.9939254
## subject_broadSpecial Needs subject_broadWarmth Care Hunger
## 1.0365423 1.1403604
From the results immediately above, we see that many of the categories created are in fact significant predictors of project approval. But again, this model is so simple as to almost be useless for the purposes of prediction within the context of this project. That being said, if we were interested in examining ‘Broad Subject Cateogry’ as a fixed effect, we could undoubtedly make an argument otherwise. More specifcally, we could claim that submitting an application under the heading ‘Literacy and Language’, for example, increases the likelihood of application approval by a factor of 43%. In all, the significance test suggests that this feature will perhaps improve the accuracy of our model in terms of AUC. We continue with a few more significance tests, only to ascertain whether it would seem reasonable to include the features in our models below.
fit_sex =
glm(project_is_approved ~ sex,
data = train,
family = binomial)
Anova(fit_sex, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## sex 5 1 4.8063 0.02836 *
## Residuals 182076 182074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_experience =
glm(project_is_approved ~ experience,
data = train,
family = binomial)
Anova(fit_experience, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## experience 618 2 308.84 < 2.2e-16 ***
## Residuals 182080 182077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_special_needs =
glm(project_is_approved ~ special_needs_and_interdisciplinary,
data = train,
family = binomial)
Anova(fit_special_needs, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## special_needs_and_interdisciplinary 14 2 7.2154 0.0007354 ***
## Residuals 182080 182077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_weekday =
glm(project_is_approved ~ weekday,
data = train,
family = binomial)
Anova(fit_weekday, type = 3)## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## LR Chisq Df Pr(>Chisq)
## weekday 26.026 6 0.0002202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_month =
glm(project_is_approved ~ monthname,
data = train,
family = binomial)
Anova(fit_month, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## monthname 290 11 26.325 < 2.2e-16 ***
## Residuals 182080 182068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results above indicate that experience, monthname, special_needs_and_interdisciplinary, and sex are significant predictors of project approval.
Before turning our attention to the data set ‘resources’, we conduct a few more manipulations. Due to high cardinality of features ‘project_subject_categories’ and ‘school_state’, we carry out a bit of target encoding. We then proceed with the ‘resources’ data set. As seen below, many other features are created from that data set, and many of these were inspired by Bukun. These features are then joined to the ‘train’ data set, at which point we do a bit of language processing before modeling.
# -- 1. Target Encode Project SubCategory, Mean
subcategory_average = train %>%
group_by(project_subject_subcategories) %>%
dplyr::summarise(SubcategoryAverage = mean(project_is_approved))
# -- 2. Join New Feature to Train Data Set
train =
left_join(train,subcategory_average %>%
select (SubcategoryAverage,
project_subject_subcategories),by="project_subject_subcategories")
# -- 3. Target Encode Project SubCategory, Standard Deviation
subcategory_sd = train %>%
group_by(project_subject_subcategories) %>%
dplyr::summarise(SubcategorySD = sd(project_is_approved))
# -- 4. Join New Feature to Train Data Set
train =
left_join(train,subcategory_sd %>%
select (SubcategorySD,
project_subject_subcategories),by="project_subject_subcategories")
# -- 5. Target Encode school_state, mean
state_average = train %>%
group_by(school_state) %>%
dplyr::summarise(StateAverage = mean(project_is_approved))
# -- 6. Join New Feature to Train Data Set
train =
left_join(train,state_average %>%
select (StateAverage, school_state),by="school_state")
# -- 7. Create Total Amount Variable
resources = resources %>%
mutate(TotalAmount = (quantity*price))
# -- 8. Create New Feature and Join to Train Data Set
resources2 = resources %>%
group_by(id) %>%
dplyr::summarise(TotalAmountPerID = sum(TotalAmount))
train =
left_join(train,resources2 %>%
select(TotalAmountPerID,id),by = "id")
# -- 9. Create New Feature and Join to Train Data Set
resources_median_amount = resources %>%
group_by(id) %>%
dplyr::summarise(MedianAmount = median(TotalAmount))
train =
left_join(train,resources_median_amount %>%
select(MedianAmount,id),by = "id")
# -- 10. Create New Feature and Join to Train Data Set
resources_min_total = resources %>%
group_by(id) %>%
dplyr::summarise(MinTotal = min(TotalAmount))
train =
left_join(train,resources_min_total %>%
select(MinTotal,id),by = "id")
# -- 11. Create New Feature and Join to Train Data Set
resources_max_total = resources %>%
group_by(id) %>%
dplyr::summarise(MaxTotal = max(TotalAmount))
train =
left_join(train,resources_max_total %>%
select(MaxTotal,id),by = "id")
# -- 12. Create New Feature and Join to Train Data Set
resources_quantity = resources %>%
group_by(id) %>%
dplyr::summarise(TotalQty = sum(quantity))
train =
left_join(train,resources_quantity %>%
select(TotalQty,id),by = "id")
# -- 13. Create New Feature and Join to Train Data Set
resources_count = resources %>%
group_by(id) %>%
dplyr::summarise(CountItems = n())
train =
left_join(train,resources_count %>%
select(CountItems,id),by = "id")
# -- 14. Create New Feature and Join to Train Data Set
resources_max_price = resources %>%
group_by(id) %>%
dplyr::summarise(MaxPrice = max(price))
train =
left_join(train,resources_max_price %>%
select(MaxPrice,id),by = "id")
# -- 15. Create New Feature and Join to Train Data Set
resources_mean_price = resources %>%
group_by(id) %>%
dplyr::summarise(MeanPrice = mean(price))
train =
left_join(train,resources_mean_price %>%
select(MeanPrice,id),by = "id")
# -- 16. Create New Feature and Join to Train Data Set
resources_min_price = resources %>%
group_by(id) %>%
dplyr::summarise(MinPrice = min(price))
train =
left_join(train,resources_min_price %>%
select(MinPrice,id),by = "id")
# -- 17. Add Project Description to Train Data Set
project_description = resources %>%
group_by(description) %>%
dplyr::select(description, id)
train =
left_join(train,project_description, by = "id")Some might wonder the wisdom of engineering myriad features. As seen below, quite a few have near zero variance and for this reason, do not improve predictive power. But again, because we are dealing with a highly imbalanced target variable – because the group of approved applications far outnumbers the group of denied applications – at this point – that is, prior to having conducted significance tests – it would seem reasonable to engineer as many features as possible. Our criterion for engineering: there is a vanishingly small chance the feature may help us cleave one group from the other. (Admittedly, this criterion is open to debate and might strike some as far too lax.) For example, perhaps the total amount of funding requested for a project denied is significantly different from the total amount of funding requested for a project approved. Accordingly, total amount requested would seem an important feature to include.
Anyhow, we now take a quick glimpse at our data set to see what we’re working with, and to make sure that everything looks okay.
## Observations: 1,081,830
## Variables: 34
## $ id <chr> "p036502", "p036502", "p0395…
## $ school_state <fct> NV, NV, GA, UT, NC, NC, NC, …
## $ project_grade_category <fct> Grades PreK-2, Grades PreK-2…
## $ project_subject_categories <fct> "Literacy & Language", "Lite…
## $ project_subject_subcategories <fct> "Literacy", "Literacy", "Per…
## $ project_title <chr> "Super Sight Word Centers", …
## $ project_essay_1 <chr> "Most of my kindergarten stu…
## $ project_essay_2 <chr> "I currently have a differen…
## $ project_essay_3 <chr> NA, NA, NA, NA, NA, NA, NA, …
## $ project_essay_4 <chr> NA, NA, NA, NA, NA, NA, NA, …
## $ project_resource_summary <chr> "My students need 6 Ipod Nan…
## $ project_is_approved <lgl> TRUE, TRUE, FALSE, TRUE, FAL…
## $ day <date> 2016-11-18, 2016-11-18, 201…
## $ sex <fct> F, F, F, F, M, M, M, M, M, M…
## $ previous_projects <dbl> 26, 26, 1, 5, 16, 16, 16, 16…
## $ experience <fct> Experienced, Experienced, Be…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday <fct> Friday, Friday, Wednesday, S…
## $ monthname <fct> 11, 11, 4, 1, 8, 8, 8, 8, 8,…
## $ year <fct> 2016, 2016, 2017, 2017, 2016…
## $ subject_broad <fct> Literacy and Language, Liter…
## $ SubcategoryAverage <dbl> 0.8840571, 0.8840571, 0.7142…
## $ SubcategorySD <dbl> 0.3201666, 0.3201666, 0.4629…
## $ StateAverage <dbl> 0.8567697, 0.8567697, 0.8340…
## $ TotalAmountPerID <dbl> 899.94, 899.94, 400.00, 469.…
## $ MedianAmount <dbl> 449.970, 449.970, 400.000, 4…
## $ MinTotal <dbl> 449.97, 449.97, 400.00, 469.…
## $ MaxTotal <dbl> 449.97, 449.97, 400.00, 469.…
## $ TotalQty <dbl> 6, 6, 20, 1, 5, 5, 5, 5, 5, …
## $ CountItems <int> 2, 2, 1, 1, 5, 5, 5, 5, 5, 1…
## $ MaxPrice <dbl> 149.99, 149.99, 20.00, 469.9…
## $ MeanPrice <dbl> 149.990000, 149.990000, 20.0…
## $ MinPrice <dbl> 149.99, 149.99, 20.00, 469.9…
## $ description <chr> "Apple - iPod nano� 16GB MP3…
Everything looks okay. The only thing we might consider doing is dropping Essay 3 and Essay 4. They have too many NA values to be useful at this point – at least to my mind they do. Below, we see that 96% of both columns are blank.
a = sum(is.na(train$project_essay_3))
b = sum(!is.na(train$project_essay_3))
porportion_na_3 = a/(b+a)
100*(porportion_na_3)## [1] 96.47292
d = sum(is.na(train$project_essay_4))
e = sum(!is.na(train$project_essay_4))
porportion_na_4 = d/(d+e)
100*(porportion_na_4)## [1] 96.47292
We drop the columns and again glimpse our data to see that we’re all set.
## Observations: 1,081,830
## Variables: 32
## $ id <chr> "p036502", "p036502", "p0395…
## $ school_state <fct> NV, NV, GA, UT, NC, NC, NC, …
## $ project_grade_category <fct> Grades PreK-2, Grades PreK-2…
## $ project_subject_categories <fct> "Literacy & Language", "Lite…
## $ project_subject_subcategories <fct> "Literacy", "Literacy", "Per…
## $ project_title <chr> "Super Sight Word Centers", …
## $ project_essay_1 <chr> "Most of my kindergarten stu…
## $ project_essay_2 <chr> "I currently have a differen…
## $ project_resource_summary <chr> "My students need 6 Ipod Nan…
## $ project_is_approved <lgl> TRUE, TRUE, FALSE, TRUE, FAL…
## $ day <date> 2016-11-18, 2016-11-18, 201…
## $ sex <fct> F, F, F, F, M, M, M, M, M, M…
## $ previous_projects <dbl> 26, 26, 1, 5, 16, 16, 16, 16…
## $ experience <fct> Experienced, Experienced, Be…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday <fct> Friday, Friday, Wednesday, S…
## $ monthname <fct> 11, 11, 4, 1, 8, 8, 8, 8, 8,…
## $ year <fct> 2016, 2016, 2017, 2017, 2016…
## $ subject_broad <fct> Literacy and Language, Liter…
## $ SubcategoryAverage <dbl> 0.8840571, 0.8840571, 0.7142…
## $ SubcategorySD <dbl> 0.3201666, 0.3201666, 0.4629…
## $ StateAverage <dbl> 0.8567697, 0.8567697, 0.8340…
## $ TotalAmountPerID <dbl> 899.94, 899.94, 400.00, 469.…
## $ MedianAmount <dbl> 449.970, 449.970, 400.000, 4…
## $ MinTotal <dbl> 449.97, 449.97, 400.00, 469.…
## $ MaxTotal <dbl> 449.97, 449.97, 400.00, 469.…
## $ TotalQty <dbl> 6, 6, 20, 1, 5, 5, 5, 5, 5, …
## $ CountItems <int> 2, 2, 1, 1, 5, 5, 5, 5, 5, 1…
## $ MaxPrice <dbl> 149.99, 149.99, 20.00, 469.9…
## $ MeanPrice <dbl> 149.990000, 149.990000, 20.0…
## $ MinPrice <dbl> 149.99, 149.99, 20.00, 469.9…
## $ description <chr> "Apple - iPod nano� 16GB MP3…
As indicated above, some of the inputs, like project_essay_1, are character vectors. In fact, they contain short responses to prompts. Below, we engineer yet more features, ones that tabulate word count, for example. Because our word count function is computationally intensive, we conduct word counts on a sample of our data. We then conduct significance testing and use those results to guide development of relevant features.
trainingwheelz = trainingwheelz%>%
mutate(project_title_word_count =
sapply(gregexpr("[[:alpha:]]+",
trainingwheelz$project_title),
function(x) sum(x > 0)),
essay_1_wordcount =
sapply(gregexpr("[[:alpha:]]+",
trainingwheelz$project_essay_1),
function(x) sum(x > 0)),
essay_2_wordcount =
sapply(gregexpr("[[:alpha:]]+",
trainingwheelz$project_essay_2),
function(x) sum(x > 0)),
project_resource_summary_wordcount =
sapply(gregexpr("[[:alpha:]]+",
trainingwheelz$project_resource_summary),
function(x) sum(x > 0)),
project_description_count =
sapply(gregexpr("[[:alpha:]]+",
trainingwheelz$description),
function(x) sum(x > 0)))fit_overall_model = glm(project_is_approved ~
essay_1_wordcount +
essay_2_wordcount +
project_title_word_count +
project_resource_summary_wordcount +
project_description_count,
data = trainingwheelz,
family = binomial)
Anova(fit_overall_model, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values Pr(>F)
## essay_1_wordcount 9.6 1 9.5620 0.0019920 **
## essay_2_wordcount 159.5 1 158.7052 < 2.2e-16 ***
## project_title_word_count 0.9 1 0.8749 0.3496246
## project_resource_summary_wordcount 11.9 1 11.7938 0.0005967 ***
## project_description_count 0.1 1 0.1064 0.7443275
## Residuals 10045.5 9993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we engineer new features for the ‘train’ data set in its partiality
train = train %>%
mutate(essay_2_wordcount =
sapply(gregexpr("[[:alpha:]]+",
train$project_essay_2),
function(x) sum(x > 0)),
essay_1_wordcount =
sapply(gregexpr("[[:alpha:]]+",
train$project_essay_1),
function(x) sum(x > 0)),
project_title_word_count =
sapply(gregexpr("[[:alpha:]]+",
train$project_title),
function(x) sum(x > 0)),
project_resource_summary_wordcount =
sapply(gregexpr("[[:alpha:]]+",
train$project_resource_summary),
function(x) sum(x > 0)))## Observations: 30,000
## Variables: 36
## $ id <chr> "p051521", "p029088", "p1037…
## $ school_state <fct> OH, IN, FL, LA, CA, TX, MD, …
## $ project_grade_category <fct> Grades PreK-2, Grades 3-5, G…
## $ project_subject_categories <fct> "Literacy & Language", "Math…
## $ project_subject_subcategories <fct> "Literacy, Literature & Writ…
## $ project_title <chr> "Unique Second Grade Readers…
## $ project_essay_1 <chr> "My job as an interventionis…
## $ project_essay_2 <chr> "This project will have a hu…
## $ project_resource_summary <chr> "My students need guided rea…
## $ project_is_approved <lgl> TRUE, TRUE, FALSE, TRUE, TRU…
## $ day <date> 2016-05-24, 2016-09-01, 201…
## $ sex <fct> F, F, F, F, F, F, F, F, F, F…
## $ previous_projects <dbl> 1, 0, 3, 0, 23, 1, 0, 7, 1, …
## $ experience <fct> Beginner, Novice, Experience…
## $ special_needs_and_interdisciplinary <fct> Uni-Disciplinary, Uni-Discip…
## $ weekday <fct> Tuesday, Thursday, Friday, S…
## $ monthname <fct> 5, 9, 8, 8, 8, 10, 8, 3, 10,…
## $ year <fct> 2016, 2016, 2016, 2016, 2016…
## $ subject_broad <fct> Literacy and Language, Music…
## $ SubcategoryAverage <dbl> 0.8724413, 0.8187147, 0.8389…
## $ SubcategorySD <dbl> 0.3336158, 0.3852757, 0.3678…
## $ StateAverage <dbl> 0.8714665, 0.8479369, 0.8245…
## $ TotalAmountPerID <dbl> 1112.03, 282.86, 442.13, 927…
## $ MedianAmount <dbl> 39.990, 30.010, 29.990, 49.9…
## $ MinTotal <dbl> 4.21, 27.85, 4.99, 24.90, 24…
## $ MaxTotal <dbl> 149.00, 225.00, 99.80, 479.0…
## $ TotalQty <dbl> 30, 11, 38, 42, 45, 30, 37, …
## $ CountItems <int> 25, 3, 13, 6, 7, 30, 10, 22,…
## $ MaxPrice <dbl> 149.00, 30.01, 39.99, 479.00…
## $ MeanPrice <dbl> 38.634400, 27.620000, 20.567…
## $ MinPrice <dbl> 4.21, 25.00, 4.99, 2.49, 5.6…
## $ description <chr> "TT168 - Level H Book Bin - …
## $ essay_2_wordcount <int> 137, 104, 104, 104, 144, 109…
## $ essay_1_wordcount <int> 99, 169, 83, 133, 104, 96, 1…
## $ project_title_word_count <int> 4, 4, 3, 6, 3, 7, 9, 6, 9, 4…
## $ project_resource_summary_wordcount <int> 32, 28, 12, 16, 25, 33, 12, …
We now load libraries useful for textual analysis. Note that the analysis below is very rudimentary, and a more thorough textual analysis will be conducted near the close of this project, as it is more directly related to predictive modeling (i.e., tf idf.) Again, we use a sample from our data set because the method employed below is computationally intensive. Also, it is important to note that the example below only deals with the feature ‘project_resource_summary,’ but the method can easily be extended to other character features (i.e., project_essay_1, etc.) as I have done, and as is apparent below.
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(ggplot2)
library(tidytext)
library(reshape)Note that the data below is again partitioned by approval. As will be seen, this analysis results in word frequency distributions unique to each group. Keywords – those that distinguish successful applications from unsuccessful applications – will then be used to engineer yet more features.
applications_approved = filter(z, project_is_approved > 0)
app_resources_requested_approved=applications_approved$project_resource_summary
app_resources_requested_approved=as.character(app_resources_requested_approved)
app_resources_requested_approved_text <- VectorSource(app_resources_requested_approved)
corpus <- VCorpus(app_resources_requested_approved_text)
docs = corpus
docs <- tm_map(docs,removePunctuation)
for (j in seq(docs)) {
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("\u2028", " ", docs[[j]])
}
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removeWords, stopwords("english") )
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm1 <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm1))
freq <- sort(colSums(as.matrix(dtm1)), decreasing=TRUE)
project_summary_approved_keywords <- data.frame(word=names(freq), freq=freq)Below is a textual analysis of denied applications.
applications_denied = filter(z, project_is_approved < 1)
app_resources_requested_denied=applications_denied$project_resource_summary
app_resources_requested_denied=as.character(app_resources_requested_denied)
app_resources_requested_denied_text <- VectorSource(app_resources_requested_denied)
corpus <- VCorpus(app_resources_requested_denied_text)
docs = corpus
docs <- tm_map(docs,removePunctuation)
for (j in seq(docs)) {
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("\u2028", " ", docs[[j]])
}
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removeWords, stopwords("english") )
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
project_summary_denied_keywords <- data.frame(word=names(freq), freq=freq) We remove the two terms with highest frequency, as they are uninstructive and common to both groups, projects approved and denied.
Below is a visualization of word frequency, from projects approved. Note that these are the words most commonly used in successful applications, with regard to project_resource_summary.
p <- ggplot(subset(project_summary_approved_keywords, freq>300),
aes(x = reorder(word, -freq), y = freq, fill = freq)) +
geom_bar(stat = "identity") +
labs(title="Word Frequency Analysis",
subtitle= "Keyword Analysis: Approved") +
labs(x = NULL, y = NULL, fill= "Frequency") +
theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
axis.text.x = element_text(angle = 90))
pAgain, we remove the two terms with the highest frequency, as they are uninstructive and common to both groups, approved and denied.
project_summary_denied_keywords=project_summary_denied_keywords[-1,]
project_summary_denied_keywords=project_summary_denied_keywords[-1,]Below is yet another visualization of word frequency, but from projects denied. Note that these are the words most commonly employed in unsuccessful applications, with regard to project_resource_summary.
p <- ggplot(subset(project_summary_denied_keywords, freq>100),
aes(x = reorder(word, -freq), y = freq, fill = freq)) +
geom_bar(stat = "identity") +
labs(title="Word Frequency Analysis",
subtitle= "Keyword Analysis: Denied") +
labs(x = NULL, y = NULL, fill= "Frequency") +
theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
axis.text.x = element_text(angle = 90))
pHere we use the arsenal package to tabulate the keywords unique to each group.
#-- arsenal for keyword comparison
library(arsenal)
top_40_project_summary_denied_keywords = head(project_summary_denied_keywords, 40)
top_40_project_summary_denied_keywords = top_40_project_summary_denied_keywords %>%
select(word)
top_40_project_summary_approved_keywords = head(project_summary_approved_keywords, 40)
top_40_project_summary_approved_keywords = top_40_project_summary_approved_keywords %>%
select(word)
obj = summary(comparedf(top_40_project_summary_denied_keywords,
top_40_project_summary_approved_keywords, by = "word"))
y = obj$obs.table
y## version word observation
## 2 x also 39
## 5 x basic 33
## 14 x environment 40
## 15 x equipment 27
## 16 x fun 36
## 18 x hands 31
## 34 x play 34
## 43 x technology 35
## 44 x time 37
## 45 x tools 30
## 51 x year 38
## 4 y balls 37
## 6 y book 34
## 9 y chairs 32
## 13 y engaging 33
## 21 y keep 26
## 24 y library 11
## 26 y love 31
## 28 y markers 39
## 35 y read 13
## 36 y readers 40
## 50 y writing 36
Below is a word cloud for approved projects.
set.seed(1234)
wordcloud(words = project_summary_approved_keywords$word, freq = project_summary_approved_keywords$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Below is yet another word cloud, but for denied projects. Viewed in conjuction with the word cloud for approved projects, it is clear that approved projects and denied projects contain different keywords. This observation allows us to engineer yet more features, as seen below.
wordcloud(words = project_summary_denied_keywords$word, freq = project_summary_denied_keywords$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Continued below is a brief look at topic modeling. Essentially, we want to differentiate themes or topics implicit to, in this case, the project_resource_summary feature. The assumption driving this analysis: that groupings exist of which we are currently unaware. For example, maybe many teachers are concered with securing funding to purchase new technologies; perhaps other teachers are concered with funding for field trips. We can use these groupings, once discovered, to partition and further analyze our data. Said groupings, it is thought, are potentially more meaningful than other, somewhat more arbitrary groupings, like geographcial location. Whatever the case may be, topic modeling will likely help us better understand our data.
We load libraries, use document term matrices from above, and plot results. A great online resource can be found here if you’re interested in learning more.
topics_lda <- LDA(dtm, k = 8, control = list(seed = 1234))
typical_topics <- tidy(topics_lda, matrix = "beta")top_terms <- typical_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered() As it stands, our topic modeling analyses would appear somewhat anti-climactic and uninstructive: the groupings are far too similiar to be useful. Bummer. (I plan to revisit this and expand.)
Prior to predictive modeling, we create the last of our features: critical keyword counts. As alluded to above, the arsenal package (note that this package causes Kaggle to throw a fit) allows us to compare data frames in terms of uniqueness. Accordingly, we compare word frequency for project approved and denied. We are left with a new data frame, one including high use terms exclusive to approval and denial. We then use the word count function, as seen below, to count each keyword and derive a cumuluative relevancy score. It is undoubtedly inelegant, cumbersome code, but it gets the job done, if slowly. (I plan to revisit and refine this…)
train = train%>%
mutate(keyword1_project_resource_summary = sapply(gregexpr("able",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword2_project_resource_summary = sapply(gregexpr("art",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword3_project_resource_summary = sapply(gregexpr("equipment",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword4_project_resource_summary = sapply(gregexpr("games",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword5_project_resource_summary = sapply(gregexpr("hands",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword6_project_resource_summary = sapply(gregexpr("items",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword7_project_resource_summary = sapply(gregexpr("options",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword8_project_resource_summary = sapply(gregexpr("projects",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword9_project_resource_summary = sapply(gregexpr("stem",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword10_project_resource_summary = sapply(gregexpr("variety",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword11_project_resource_summary = sapply(gregexpr("balls",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword12_project_resource_summary = sapply(gregexpr("chromebooks",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword13_project_resource_summary = sapply(gregexpr("headphones",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword14_project_resource_summary = sapply(gregexpr("ipad",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword15_project_resource_summary = sapply(gregexpr("keep",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword16_project_resource_summary = sapply(gregexpr("set",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword17_project_resource_summary = sapply(gregexpr("time",
train$project_resource_summary),
function(x) sum(x > 0)),
keyword18_project_resource_summary = sapply(gregexpr("wobble",
train$project_resource_summary),
function(x) sum(x > 0)),
overall_relevancy_score_for_project_resource_summary1 =
keyword1_project_resource_summary +
keyword2_project_resource_summary +
keyword3_project_resource_summary +
keyword4_project_resource_summary +
keyword5_project_resource_summary +
keyword6_project_resource_summary +
keyword7_project_resource_summary +
keyword8_project_resource_summary +
keyword9_project_resource_summary +
keyword10_project_resource_summary,
overall_relevancy_score_for_project_resource_summary2 =
keyword11_project_resource_summary +
keyword12_project_resource_summary +
keyword13_project_resource_summary +
keyword14_project_resource_summary +
keyword15_project_resource_summary +
keyword16_project_resource_summary +
keyword17_project_resource_summary +
keyword18_project_resource_summary)A bit of significance testing to round our analysis out.
fit_resource_summary5 = glm(project_is_approved ~
overall_relevancy_score_for_project_resource_summary1 +
overall_relevancy_score_for_project_resource_summary2,
data = train, family = binomial)
Anova(fit_resource_summary5, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df
## overall_relevancy_score_for_project_resource_summary1 212.6 1
## overall_relevancy_score_for_project_resource_summary2 30.5 1
## Residuals 29965.5 29997
## F values Pr(>F)
## overall_relevancy_score_for_project_resource_summary1 212.84 < 2.2e-16
## overall_relevancy_score_for_project_resource_summary2 30.58 3.231e-08
## Residuals
##
## overall_relevancy_score_for_project_resource_summary1 ***
## overall_relevancy_score_for_project_resource_summary2 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, it would seem that our relevancy score for the Project Resource Summary is a significant predictor of the likelihood of project approval, and for that reason, it may very well facilitate the development of more accurate models.
We now turn to our relevancy score for Essay 2.
train = train%>%
mutate(keyword1_count_essay_2 = sapply(gregexpr("activities",
train$project_essay_2),
function(x) sum(x > 0)),
keyword2_count_essay_2 = sapply(gregexpr("children",
train$project_essay_2),
function(x) sum(x > 0)),
keyword3_count_essay_2 = sapply(gregexpr("science",
train$project_essay_2),
function(x) sum(x > 0)),
keyword4_count_essay_2 = sapply(gregexpr("supplies",
train$project_essay_2),
function(x) sum(x > 0)),
keyword5_count_essay_2 = sapply(gregexpr("way",
train$project_essay_2),
function(x) sum(x > 0)),
keyword6_count_essay_2 = sapply(gregexpr("access",
train$project_essay_2),
function(x) sum(x > 0)),
keyword7_count_essay_2 = sapply(gregexpr("one",
train$project_essay_2),
function(x) sum(x > 0)),
keyword8_count_essay_2 = sapply(gregexpr("read",
train$project_essay_2),
function(x) sum(x > 0)),
keyword9_count_essay_2 = sapply(gregexpr("used",
train$project_essay_2),
function(x) sum(x > 0)),
keyword10_count_essay_2 = sapply(gregexpr("using",
train$project_essay_2),
function(x) sum(x > 0)),
overall_relevancy_score_for_project_essay_2 =
keyword1_count_essay_2 +
keyword2_count_essay_2 +
keyword3_count_essay_2 +
keyword4_count_essay_2 +
keyword5_count_essay_2,
overall_relevancy_score_for_project_essay_2.1 =
keyword6_count_essay_2 +
keyword7_count_essay_2 +
keyword8_count_essay_2 +
keyword9_count_essay_2 +
keyword10_count_essay_2)And a bit more significance testing.
fit_project_essay_2.5 = glm(project_is_approved ~
overall_relevancy_score_for_project_essay_2 +
overall_relevancy_score_for_project_essay_2.1,
data = train, family = binomial)
Anova(fit_project_essay_2.5, type = 3, test.statistic = "F")## Analysis of Deviance Table (Type III tests)
##
## Response: project_is_approved
## Error estimate based on Pearson residuals
##
## Sum Sq Df F values
## overall_relevancy_score_for_project_essay_2 57.9 1 57.331
## overall_relevancy_score_for_project_essay_2.1 724.3 1 716.634
## Residuals 30317.5 29997
## Pr(>F)
## overall_relevancy_score_for_project_essay_2 3.788e-14 ***
## overall_relevancy_score_for_project_essay_2.1 < 2.2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, from the results above, it would seem that our relevancy score for Essay 2 is a significant predictor of the likelihood of project approval, and for that reason, it may very well facilitate the development of more accurate models.
Finally, we take another peek at our data to make sure it looks okay, and we create one last feature. Note that we are working with somewhat of a smallish sample of our data, but at this point, the aim of this kernel is to develop a baseline AUC score. In other words, becuase we want to reduce computing intensity, we’re working with a smaller sample, developing a model we think might work, and then, when we’re satisfied with our results, we can let loose on the data set in its entirety. There are many ways of going about this, and this just one approach, being rethought and reexamined as we knit.
Now that we’ve completed our initial exploration, we continue with modeling to the end of establishing a baseline AUC. Subsequent to the establishment of a baseline, we will refine our model to the end of improving accuracy. First, we clean our data set, removing redundant features (i.e., school_state, project_grade_category, etc.).
train = train %>%
select(-id,
-school_state,
-project_grade_category,
-project_subject_categories,
-project_subject_subcategories,
-project_title,
-project_essay_1,
-project_essay_2,
-project_resource_summary,
-day)
glimpse(train)## Observations: 30,000
## Variables: 59
## $ project_is_approved <lgl> TRUE, TRUE…
## $ sex <fct> F, F, F, F…
## $ previous_projects <dbl> 1, 0, 3, 0…
## $ experience <fct> Beginner, …
## $ special_needs_and_interdisciplinary <fct> Uni-Discip…
## $ weekday <fct> Tuesday, T…
## $ monthname <fct> 5, 9, 8, 8…
## $ year <fct> 2016, 2016…
## $ subject_broad <fct> Literacy a…
## $ SubcategoryAverage <dbl> 0.8724413,…
## $ SubcategorySD <dbl> 0.3336158,…
## $ StateAverage <dbl> 0.8714665,…
## $ TotalAmountPerID <dbl> 1112.03, 2…
## $ MedianAmount <dbl> 39.990, 30…
## $ MinTotal <dbl> 4.21, 27.8…
## $ MaxTotal <dbl> 149.00, 22…
## $ TotalQty <dbl> 30, 11, 38…
## $ CountItems <int> 25, 3, 13,…
## $ MaxPrice <dbl> 149.00, 30…
## $ MeanPrice <dbl> 38.634400,…
## $ MinPrice <dbl> 4.21, 25.0…
## $ description <chr> "TT168 - L…
## $ essay_2_wordcount <int> 137, 104, …
## $ essay_1_wordcount <int> 99, 169, 8…
## $ project_title_word_count <int> 4, 4, 3, 6…
## $ project_resource_summary_wordcount <int> 32, 28, 12…
## $ keyword1_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword2_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword3_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword4_project_resource_summary <int> 1, 0, 0, 0…
## $ keyword5_project_resource_summary <int> 0, 0, 1, 0…
## $ keyword6_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword7_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword8_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword9_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword10_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword11_project_resource_summary <int> 0, 1, 0, 0…
## $ keyword12_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword13_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword14_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword15_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword16_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword17_project_resource_summary <int> 0, 0, 0, 0…
## $ keyword18_project_resource_summary <int> 0, 0, 0, 0…
## $ overall_relevancy_score_for_project_resource_summary1 <int> 1, 0, 1, 0…
## $ overall_relevancy_score_for_project_resource_summary2 <int> 0, 1, 0, 0…
## $ keyword1_count_essay_2 <int> 0, 0, 1, 0…
## $ keyword2_count_essay_2 <int> 0, 5, 0, 0…
## $ keyword3_count_essay_2 <int> 0, 0, 1, 0…
## $ keyword4_count_essay_2 <int> 0, 0, 1, 1…
## $ keyword5_count_essay_2 <int> 0, 1, 0, 0…
## $ keyword6_count_essay_2 <int> 0, 0, 0, 0…
## $ keyword7_count_essay_2 <int> 0, 0, 0, 1…
## $ keyword8_count_essay_2 <int> 3, 0, 1, 0…
## $ keyword9_count_essay_2 <int> 2, 0, 1, 1…
## $ keyword10_count_essay_2 <int> 1, 0, 0, 0…
## $ overall_relevancy_score_for_project_essay_2 <int> 0, 6, 3, 1…
## $ overall_relevancy_score_for_project_essay_2.1 <int> 6, 0, 2, 2…
## $ CategoryAverage <dbl> 0.8548188,…
Everything looks okay, although the introduction of keyword count hints of a sparse matrix – certainly a potential problem. As mentioned, once the model seems to perform optimally on our sample, we’ll come back and give it a go on the entirety of the data set. Below we load the caret package for modeling and the pROC package for AUC analysis. We also load the caretEnsemble package, as it allows us to run more than one model at a time. Depending on model performance – more specifically, depending how the models correlate – we might think to ensemble the models to increase predictive performance.
We now partition our data into training and testing sets. Note that because we’re merely attempting to establish a baseline AUC, we’ll jump right in, although preprocessing would be seem desireable. (We also need to set our target as a factor.)
Here we partition our data.
set.seed(3147)
inTrain <- createDataPartition(y=trainer$project_is_approved,
p=0.70, list=FALSE)
training <- trainer[inTrain,]
testing <- trainer[-inTrain,]We set a few rules, and then we have at it.
tcontrol <- trainControl(method = "cv",
number = 3,
index = createFolds(training$project_is_approved, 3),
savePredictions = "final",
verbose = FALSE,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(3147)
first_model <- train(project_is_approved ~
MedianAmount +
essay_2_wordcount +
TotalAmountPerID +
previous_projects +
CountItems +
keyword6_project_resource_summary +
SubcategoryAverage +
keyword9_count_essay_2 +
keyword4_count_essay_2 +
StateAverage +
keyword3_project_resource_summary +
keyword8_count_essay_2 +
keyword5_project_resource_summary +
keyword18_project_resource_summary +
project_title_word_count +
keyword10_project_resource_summary +
keyword13_project_resource_summary +
essay_1_wordcount +
CategoryAverage +
keyword4_project_resource_summary +
project_resource_summary_wordcount +
keyword6_count_essay_2 +
keyword6_count_essay_2 +
keyword2_count_essay_2 +
MeanPrice +
MaxPrice,
data = training,
na.action = na.omit,
trControl = tcontrol,
metric="ROC",
method = "glm")Below are the results of our first model.
## Generalized Linear Model
##
## 14001 samples
## 25 predictor
## 2 classes: 'FALSE.', 'TRUE.'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 4667, 4667, 4667
## Resampling results:
##
## ROC Sens Spec
## 0.7039846 0.04689932 0.9851982
Here is a plot of feature importance. As suggested, many of the features we engineered seem to be working well.
Having thus run our first model, we now predict on our ‘testing’ partition. Below are the results.
predictions <- predict(first_model, newdata=testing, type = "prob")
pred = c(predictions)
obs = c(testing$project_is_approved)
compare = data.frame(pred,obs)Below we see an AUC of 0.69. This is okay, not great. Perhpas if we use other models we can improve our baseline.
##
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
##
## Data: pred in 1251 controls (obs 1) > 4748 cases (obs 2).
## Area under the curve: 0.6924
Here we load libraries and fit various models at once. (Note the algorithmList.)
traincontrol <- trainControl(method = "cv",
number = 3,
index = createFolds(training$project_is_approved, 3),
savePredictions = "final",
verbose = FALSE,
classProbs = TRUE,
summaryFunction = twoClassSummary)
algorithmList <- c('xgbTree', 'gbm', 'earth')set.seed(3147)
zero_stack <- caretList(project_is_approved ~
MedianAmount +
essay_2_wordcount +
TotalAmountPerID +
previous_projects +
CountItems +
keyword6_project_resource_summary +
SubcategoryAverage +
keyword9_count_essay_2 +
keyword4_count_essay_2 +
StateAverage +
keyword3_project_resource_summary +
keyword8_count_essay_2 +
keyword5_project_resource_summary +
keyword18_project_resource_summary +
project_title_word_count +
keyword10_project_resource_summary +
keyword13_project_resource_summary +
essay_1_wordcount +
CategoryAverage +
keyword4_project_resource_summary +
project_resource_summary_wordcount +
keyword6_count_essay_2 +
keyword6_count_essay_2 +
keyword2_count_essay_2 +
MeanPrice +
MaxPrice,
data = training,
trControl = traincontrol,
metric="ROC",
methodList=algorithmList)## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0196 nan 0.1000 0.0022
## 2 1.0155 nan 0.1000 0.0019
## 3 1.0122 nan 0.1000 0.0014
## 4 1.0092 nan 0.1000 0.0014
## 5 1.0067 nan 0.1000 0.0011
## 6 1.0039 nan 0.1000 0.0006
## 7 1.0012 nan 0.1000 0.0012
## 8 0.9984 nan 0.1000 0.0011
## 9 0.9965 nan 0.1000 0.0004
## 10 0.9941 nan 0.1000 0.0009
## 20 0.9734 nan 0.1000 0.0006
## 40 0.9484 nan 0.1000 -0.0002
## 60 0.9320 nan 0.1000 0.0001
## 80 0.9228 nan 0.1000 -0.0000
## 100 0.9146 nan 0.1000 0.0000
## 120 0.9084 nan 0.1000 -0.0000
## 140 0.9035 nan 0.1000 0.0001
## 150 0.9011 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0174 nan 0.1000 0.0028
## 2 1.0093 nan 0.1000 0.0033
## 3 1.0033 nan 0.1000 0.0028
## 4 0.9985 nan 0.1000 0.0024
## 5 0.9942 nan 0.1000 0.0014
## 6 0.9890 nan 0.1000 0.0022
## 7 0.9838 nan 0.1000 0.0018
## 8 0.9807 nan 0.1000 0.0007
## 9 0.9767 nan 0.1000 0.0015
## 10 0.9732 nan 0.1000 0.0009
## 20 0.9488 nan 0.1000 0.0001
## 40 0.9181 nan 0.1000 0.0003
## 60 0.9037 nan 0.1000 -0.0000
## 80 0.8922 nan 0.1000 -0.0002
## 100 0.8829 nan 0.1000 -0.0002
## 120 0.8750 nan 0.1000 -0.0001
## 140 0.8683 nan 0.1000 -0.0005
## 150 0.8663 nan 0.1000 -0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0139 nan 0.1000 0.0044
## 2 1.0030 nan 0.1000 0.0051
## 3 0.9961 nan 0.1000 0.0029
## 4 0.9880 nan 0.1000 0.0036
## 5 0.9803 nan 0.1000 0.0032
## 6 0.9751 nan 0.1000 0.0020
## 7 0.9702 nan 0.1000 0.0019
## 8 0.9654 nan 0.1000 0.0017
## 9 0.9613 nan 0.1000 0.0015
## 10 0.9570 nan 0.1000 0.0013
## 20 0.9298 nan 0.1000 0.0005
## 40 0.9016 nan 0.1000 -0.0001
## 60 0.8846 nan 0.1000 -0.0003
## 80 0.8689 nan 0.1000 -0.0003
## 100 0.8578 nan 0.1000 -0.0002
## 120 0.8467 nan 0.1000 -0.0003
## 140 0.8361 nan 0.1000 -0.0004
## 150 0.8308 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0208 nan 0.1000 0.0015
## 2 1.0159 nan 0.1000 0.0022
## 3 1.0131 nan 0.1000 0.0005
## 4 1.0087 nan 0.1000 0.0019
## 5 1.0050 nan 0.1000 0.0014
## 6 1.0013 nan 0.1000 0.0015
## 7 0.9986 nan 0.1000 0.0011
## 8 0.9957 nan 0.1000 0.0010
## 9 0.9935 nan 0.1000 0.0007
## 10 0.9907 nan 0.1000 0.0015
## 20 0.9682 nan 0.1000 0.0005
## 40 0.9400 nan 0.1000 0.0001
## 60 0.9222 nan 0.1000 0.0004
## 80 0.9103 nan 0.1000 -0.0000
## 100 0.9019 nan 0.1000 -0.0001
## 120 0.8962 nan 0.1000 -0.0001
## 140 0.8915 nan 0.1000 -0.0003
## 150 0.8893 nan 0.1000 -0.0003
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0150 nan 0.1000 0.0038
## 2 1.0087 nan 0.1000 0.0030
## 3 1.0021 nan 0.1000 0.0028
## 4 0.9956 nan 0.1000 0.0027
## 5 0.9901 nan 0.1000 0.0025
## 6 0.9850 nan 0.1000 0.0021
## 7 0.9801 nan 0.1000 0.0018
## 8 0.9744 nan 0.1000 0.0027
## 9 0.9702 nan 0.1000 0.0014
## 10 0.9656 nan 0.1000 0.0017
## 20 0.9346 nan 0.1000 0.0007
## 40 0.9042 nan 0.1000 0.0002
## 60 0.8880 nan 0.1000 -0.0001
## 80 0.8766 nan 0.1000 -0.0001
## 100 0.8679 nan 0.1000 -0.0001
## 120 0.8596 nan 0.1000 -0.0003
## 140 0.8519 nan 0.1000 -0.0002
## 150 0.8491 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0155 nan 0.1000 0.0037
## 2 1.0074 nan 0.1000 0.0030
## 3 1.0006 nan 0.1000 0.0022
## 4 0.9910 nan 0.1000 0.0044
## 5 0.9826 nan 0.1000 0.0039
## 6 0.9761 nan 0.1000 0.0023
## 7 0.9705 nan 0.1000 0.0019
## 8 0.9644 nan 0.1000 0.0024
## 9 0.9582 nan 0.1000 0.0030
## 10 0.9521 nan 0.1000 0.0026
## 20 0.9193 nan 0.1000 0.0000
## 40 0.8897 nan 0.1000 -0.0001
## 60 0.8713 nan 0.1000 -0.0002
## 80 0.8558 nan 0.1000 0.0001
## 100 0.8438 nan 0.1000 -0.0003
## 120 0.8318 nan 0.1000 -0.0001
## 140 0.8219 nan 0.1000 -0.0001
## 150 0.8175 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0198 nan 0.1000 0.0019
## 2 1.0161 nan 0.1000 0.0015
## 3 1.0125 nan 0.1000 0.0015
## 4 1.0093 nan 0.1000 0.0011
## 5 1.0066 nan 0.1000 0.0010
## 6 1.0036 nan 0.1000 0.0010
## 7 1.0008 nan 0.1000 0.0011
## 8 0.9986 nan 0.1000 0.0010
## 9 0.9963 nan 0.1000 0.0008
## 10 0.9945 nan 0.1000 0.0006
## 20 0.9753 nan 0.1000 0.0005
## 40 0.9496 nan 0.1000 0.0000
## 60 0.9331 nan 0.1000 -0.0000
## 80 0.9225 nan 0.1000 0.0001
## 100 0.9161 nan 0.1000 -0.0001
## 120 0.9101 nan 0.1000 -0.0001
## 140 0.9063 nan 0.1000 -0.0002
## 150 0.9043 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0179 nan 0.1000 0.0031
## 2 1.0120 nan 0.1000 0.0027
## 3 1.0069 nan 0.1000 0.0018
## 4 1.0023 nan 0.1000 0.0019
## 5 0.9967 nan 0.1000 0.0026
## 6 0.9911 nan 0.1000 0.0023
## 7 0.9880 nan 0.1000 0.0010
## 8 0.9843 nan 0.1000 0.0017
## 9 0.9814 nan 0.1000 0.0010
## 10 0.9777 nan 0.1000 0.0016
## 20 0.9508 nan 0.1000 0.0004
## 40 0.9201 nan 0.1000 0.0001
## 60 0.9050 nan 0.1000 0.0000
## 80 0.8937 nan 0.1000 -0.0000
## 100 0.8845 nan 0.1000 -0.0001
## 120 0.8773 nan 0.1000 -0.0001
## 140 0.8703 nan 0.1000 -0.0003
## 150 0.8676 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0165 nan 0.1000 0.0028
## 2 1.0069 nan 0.1000 0.0038
## 3 0.9983 nan 0.1000 0.0038
## 4 0.9906 nan 0.1000 0.0032
## 5 0.9833 nan 0.1000 0.0027
## 6 0.9794 nan 0.1000 0.0013
## 7 0.9747 nan 0.1000 0.0017
## 8 0.9709 nan 0.1000 0.0009
## 9 0.9670 nan 0.1000 0.0011
## 10 0.9629 nan 0.1000 0.0013
## 20 0.9343 nan 0.1000 0.0005
## 40 0.9043 nan 0.1000 -0.0002
## 60 0.8873 nan 0.1000 -0.0001
## 80 0.8722 nan 0.1000 0.0001
## 100 0.8620 nan 0.1000 -0.0002
## 120 0.8512 nan 0.1000 -0.0003
## 140 0.8415 nan 0.1000 -0.0004
## 150 0.8371 nan 0.1000 -0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0168 nan 0.1000 0.0034
## 2 1.0106 nan 0.1000 0.0026
## 3 1.0018 nan 0.1000 0.0042
## 4 0.9941 nan 0.1000 0.0035
## 5 0.9873 nan 0.1000 0.0030
## 6 0.9821 nan 0.1000 0.0024
## 7 0.9763 nan 0.1000 0.0025
## 8 0.9715 nan 0.1000 0.0022
## 9 0.9678 nan 0.1000 0.0015
## 10 0.9635 nan 0.1000 0.0019
## 20 0.9380 nan 0.1000 0.0006
## 40 0.9168 nan 0.1000 0.0002
## 60 0.9049 nan 0.1000 0.0000
## 80 0.8967 nan 0.1000 -0.0000
## 100 0.8891 nan 0.1000 -0.0001
## 120 0.8834 nan 0.1000 -0.0001
## 140 0.8784 nan 0.1000 -0.0001
## 150 0.8761 nan 0.1000 -0.0000
Our results.
##
## Call:
## summary.resamples(object = zero_stack_results)
##
## Models: xgbTree, gbm, earth
## Number of resamples: 3
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## xgbTree 0.7156928 0.7157255 0.7157581 0.7159302 0.7160489 0.7163398 0
## gbm 0.7154749 0.7163792 0.7172835 0.7169955 0.7177559 0.7182282 0
## earth 0.7006609 0.7028412 0.7050215 0.7043107 0.7061356 0.7072497 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## xgbTree 0.1078029 0.11219625 0.11658963 0.12359051 0.1314843 0.14637904
## gbm 0.1350796 0.13687725 0.13867488 0.13779496 0.1391526 0.13963039
## earth 0.0672830 0.07342589 0.07956879 0.08130807 0.0883206 0.09707242
## NA's
## xgbTree 0
## gbm 0
## earth 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## xgbTree 0.9656153 0.9668315 0.9680477 0.9672834 0.9681175 0.9681874 0
## gbm 0.9619602 0.9635146 0.9650690 0.9654783 0.9672374 0.9694057 0
## earth 0.9715717 0.9748206 0.9780696 0.9767150 0.9792866 0.9805037 0
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(zero_stack_results, scales=scales)And finally, we predict on our testing partition.
## xgbTree gbm earth
## [1,] 0.7868816 0.8976275 0.9481296
## [2,] 0.9733845 0.9479112 0.9770617
## [3,] 0.8692275 0.8824792 0.8218733
## [4,] 0.8960258 0.8691519 0.8419912
## [5,] 0.9068097 0.9281892 0.9567983
## [6,] 0.6885239 0.6901603 0.6288690
xgbTree_results = compare%>%
select(xgbTree, obs)
pred=xgbTree_results$xgbTree
obs=xgbTree_results$obs
set.seed(3147)
roc(obs, pred, plot=TRUE)##
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
##
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.716
gbm_results = compare%>%
select(gbm, obs)
pred=gbm_results$gbm
obs=gbm_results$obs
set.seed(3147)
roc(obs, pred, plot=TRUE)##
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
##
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.711
earth_results = compare%>%
select(earth, obs)
pred=earth_results$earth
obs=earth_results$obs
set.seed(3147)
roc(obs, pred, plot=TRUE)##
## Call:
## roc.default(response = obs, predictor = pred, plot = TRUE)
##
## Data: pred in 1251 controls (obs FALSE.) < 4748 cases (obs TRUE.).
## Area under the curve: 0.699
Because the models are so highly correlated, it doesn’t quite make sense to build an ensemble. At this point, it would seem that the gbm model performs the best in terms of AUC (.72). We use this as a benchmark and will return at a later date to improve performance accuracy.
## xgbTree gbm earth
## xgbTree 1.0000000 -0.9063843 0.2729696
## gbm -0.9063843 1.0000000 0.1589952
## earth 0.2729696 0.1589952 1.0000000